Tuning SMT with a Large Number of Features via Online Feature Grouping

نویسندگان

  • Lemao Liu
  • Tiejun Zhao
  • Taro Watanabe
  • Eiichiro Sumita
چکیده

In this paper, we consider the tuning of statistical machine translation (SMT) models employing a large number of features. We argue that existing tuning methods for these models suffer serious sparsity problems, in which features appearing in the tuning data may not appear in the testing data and thus those features may be over tuned in the tuning data. As a result, we face an over-fitting problem, which limits the generalization abilities of the learned models. Based on our analysis, we propose a novel method based on feature grouping via OSCAR to overcome these pitfalls. Our feature grouping is implemented within an online learning framework and thus it is efficient for a large scale (both for features and examples) of learning in our scenario. Experiment results on IWSLT translation tasks show that the proposed method significantly outperforms the state of the art tuning methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Streaming Feature Selection Using Geometric Series of the Adjacency Matrix of Features

Feature Selection (FS) is an important pre-processing step in machine learning and data mining. All the traditional feature selection methods assume that the entire feature space is available from the beginning. However, online streaming features (OSF) are an integral part of many real-world applications. In OSF, the number of training examples is fixed while the number of features grows with t...

متن کامل

Joint Feature Selection in Distributed Stochastic Learning for Large-Scale Discriminative Training in SMT

With a few exceptions, discriminative training in statistical machine translation (SMT) has been content with tuning weights for large feature sets on small development data. Evidence from machine learning indicates that increasing the training sample size results in better prediction. The goal of this paper is to show that this common wisdom can also be brought to bear upon SMT. We deploy loca...

متن کامل

Hyperspectral Image Classification Based on the Fusion of the Features Generated by Sparse Representation Methods, Linear and Non-linear Transformations

The ability of recording the high resolution spectral signature of earth surface would be the most important feature of hyperspectral sensors. On the other hand, classification of hyperspectral imagery is known as one of the methods to extracting information from these remote sensing data sources. Despite the high potential of hyperspectral images in the information content point of view, there...

متن کامل

A Fall Detection System based on the Type II Fuzzy Logic and Multi-Objective PSO Algorithm

The Elderly health is an important and noticeable issue; since these people are priceless resources of experience in the society. Elderly adults are more likely to be severely injured or to die following falls. Hence, fast detection of such incidents may even lead to saving the life of the injured person. Several techniques have been proposed lately for the fall detection of people, mostly cate...

متن کامل

A hybrid filter-based feature selection method via hesitant fuzzy and rough sets concepts

High dimensional microarray datasets are difficult to classify since they have many features with small number ofinstances and imbalanced distribution of classes. This paper proposes a filter-based feature selection method to improvethe classification performance of microarray datasets by selecting the significant features. Combining the concepts ofrough sets, weighted rough set, fuzzy rough se...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013